20/09/2018

Welcome to Data Handling: I.C.V. 2018!

Introductory Example

Data input, processing, output

The Data Pipeline

Data Science workflow. Source: @wickham_grolemund2017

Data Science workflow. Source: Wickham and Grolemund (2017)

The Data Pipeline

Data Science workflow. Source: @wickham_grolemund2017

Data Science workflow. Source: Wickham and Grolemund (2017)

What could be the output of all this?

The Data Pipeline

  • Research report/paper (e.g., BA Thesis)
  • Presentation/Slides
  • Website
  • Web application (interactive; alas the introductory example)
  • Dashboard for management
  • Recommender system (i.e., a trained machine learning algorithm)

'Data Science'?

'Data Science'?

"This coupling of scientific discovery and practice involves the collection, management, processing, analysis, visualization, and interpretation of vast amounts of heterogeneous data associated with a diverse array of scientific, translational, and inter-disciplinary applications."

University of Michigan 'Data Science Initiative', 2015

But, what about Statistics?!

"Seemingly, statistics is being marginalized here; the implicit message is that statistics is a part of what goes on in data science but not a very big part. At the same time, many of the concrete descriptions of what the DSI will actually do will seem to statisticians to be bread-and-butter statistics. Statistics is apparently the word that dare not speak its name in connection with such an initiative!"

David Donoho (2015). 50 years of Data Science

Background

What's new about all this?

"All in all, I have come to feel that my central interest is in data analysis, which I take to include, among other things: …"

What's new about all this?

"All in all, I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."

What's new about all this?

John Tukey (The Future of Data Analysis, 1962!)

Technological change

Technological change

Source: Source: statista.com.

Source: Source: statista.com.

Top: Number of mentions of the terms 'Big Data' or 'Artificial Intelligence' in academic and media sources, 1980-2016. Bottom: Number of mentions in The New York Times and The Wall Street Journal, used as proxies for U.S. mainstream media and business media. Note logarithmic y-axis scale. Source: @katz_2017.

Top: Number of mentions of the terms 'Big Data' or 'Artificial Intelligence' in academic and media sources, 1980-2016. Bottom: Number of mentions in The New York Times and The Wall Street Journal, used as proxies for U.S. mainstream media and business media. Note logarithmic y-axis scale. Source: Katz (2017).

Organization of the Course

Help Wanted

  • Experienced R user?
  • Assist fellow students during exercises in class
  • Disclaimer: this is not an official TA position!

Help Wanted

  • Experienced R user?
  • Assist fellow students during exercises in class
  • Disclaimer: this is not an official TA position!
  • Interested?

Course evaluation: Focus Group

Course Structure

Course Concept

  • Lectures (every Thursday morning)
    • Background/Concepts
    • Live demonstrations of concepts
    • Illustration of 'hands-on' approaches

Course Concept

  • Lectures (every Thursday morning)
    • Background/Concepts
    • Live demonstrations of concepts
    • Illustration of 'hands-on' approaches
  • Workshops/Exercises (bi-weekly evening sessions)
    • Guided tutorials
    • Discussion of homework exercises
    • Recap of theoretical concepts

Course Concept

  • Lectures (every Thursday morning)
    • Background/Concepts
    • Live demonstrations of concepts
    • Illustration of 'hands-on' approaches
  • Workshops/Exercises (bi-weekly evening sessions)
    • Guided tutorials
    • Discussion of homework exercises
    • Recap of theoretical concepts
    • First Exercises (set up R/RStudio) is available on StudyNet today

Course Concept

  • Lectures (every Thursday morning)
    • Background/Concepts
    • Live demonstrations of concepts
    • Illustration of 'hands-on' approaches
  • Workshops/Exercises (bi-weekly evening sessions)
    • Guided tutorials
    • Discussion of homework exercises
    • Recap of theoretical concepts
    • First Exercises (set up R/RStudio) is available on StudyNet today
  • Guest Lectures

Course Concept

  • Strongly encouraged: Learning groups!
    • Workshops/Exercises-Sessions will provide opportunity.
    • Tackle the tricky exercises together!

15/11/2018: Guest Lecture by Dr. Michael Zehnder

Michael Zehnder, PhD, Trium EMBA
Co-Founder & CEO Swiss Data Labs AG

Part I: Data (Science) Fundamentals

## Warning: package 'readxl' was built under R version 3.5.2
## Warning: package 'kableExtra' was built under R version 3.5.2
## New names:
## * `` -> ...7
Date Topic
20.09.2018 Introduction: Big Data/Data Science, course overview
27.09.2018 An introduction to data and data processing
27.09.2018 Exercises/Workshop 1: Tools, working with text files
04.10.2018 Data storage and data structures
11.10.2018 'Big Data‘ from the Web
11.10.2018 Exercises/Workshop 2: Computer code and data storage

Part II: Data Gathering and Preparation

Date Topic
18.10.2018 Programming with data
25.10.2018 Data sources, data gathering, data import
25.10.2018 Exercises/Workshop 3: Programming with Data
15.11.2018 Guest Lecture: Dr. Michael Zehnder (Swiss Data Labs, gateB)
22.11.2018 Data preparation and manipulation
22.11.2018 Exercises/Workshop 4: Data import and data preparation/manipulation
29.11.2018 Research Insights: The Programmable Web, Big Public Data, and Political Economics

Part III: Analysis, Visualisation, Output

Date Topic
06.12.2018 Basic statistics and data analysis with R
06.12.2018 Exercises/Workshop 5: Applied data analysis with R
13.12.2018 Visualization, dynamic documents
20.12.2018 Exercises/Workshop 6: Visualization, dynamic documents; Wrap-Up, Q&A
20.12.2018 Exam Exchange Students

Core Course Resources

Main textbooks

Further resources

Exam Information

  • Central, written examination.
  • Multiple choice questions.
  • Theoretical concepts and practical applications in R (questions based on code examples).

Exam Information II

  • Exercises towards the end of the term will contain sample questions.
    • Get familiar with style of questions.
  • Exchange students who need to take the exam before the central exam block:
    • Notify me until end of September: ulrich.matter@unisg.ch!
    • Option for a decentral exam during the term (probably the day of the last lecture).

Q&A

References